Add support for different models in num_tokens_from_text function#90
Add support for different models in num_tokens_from_text function#90vidhula17 wants to merge 4 commits intomicrosoft:mainfrom
Conversation
|
@microsoft-github-policy-service agree |
thinkall
left a comment
There was a problem hiding this comment.
Thank you very much @vidhula17 for the PR, nice job! I've left some comments, could you please address them? Let me know if you need any help.
Thanks again for your contribution!
| """Return the number of tokens used by a text for different models.""" | ||
|
|
||
| # Define token counts for known models | ||
| known_models = { |
There was a problem hiding this comment.
why gpt-3.5-turbo-0301 is not in the known model?
| "gpt-4-0613": (3, 1), | ||
| "gpt-4-32k-0613": (3, 1), | ||
| } | ||
|
|
There was a problem hiding this comment.
We can add a parameter to the function, say model_token: dict = None. And add below code to support customizing model token_per_message without modifying code here.
if isinstance(model_token, dict):
known_models.update(model_token)The parameter can be passed in retrieve_config in autogen/autogen/agentchat/contrib/retrieve_user_proxy_agent.py
| if model == "your-new-model-name": | ||
| tokens_per_message = 3 | ||
| tokens_per_name = 1 | ||
| else: | ||
| raise NotImplementedError( | ||
| f"num_tokens_from_text() is not implemented for model {model}. See " | ||
| f"https://github.com/openai/openai-python/blob/main/chatml.md for information on how messages are " | ||
| f"converted to tokens." | ||
| ) |
There was a problem hiding this comment.
| if model == "your-new-model-name": | |
| tokens_per_message = 3 | |
| tokens_per_name = 1 | |
| else: | |
| raise NotImplementedError( | |
| f"num_tokens_from_text() is not implemented for model {model}. See " | |
| f"https://github.com/openai/openai-python/blob/main/chatml.md for information on how messages are " | |
| f"converted to tokens." | |
| ) | |
| tokens_per_message = 3 | |
| tokens_per_name = 1 |
| ) | ||
|
|
||
| # Use tiktoken to calculate the number of tokens in the text | ||
| encoding = tiktoken.encoding_for_model(model) |
There was a problem hiding this comment.
| encoding = tiktoken.encoding_for_model(model) | |
| try: | |
| encoding = tiktoken.encoding_for_model(model) | |
| except KeyError: | |
| logger.warning("Warning: model not found. Using cl100k_base encoding.") | |
| encoding = tiktoken.get_encoding("cl100k_base") |
try...catch is needed here.
| with self.assertRaises(NotImplementedError): | ||
| num_tokens_from_text(text, model) |
|
Also, code format checking is failed, please run |
Codecov Report
@@ Coverage Diff @@
## main #90 +/- ##
==========================================
+ Coverage 41.03% 41.33% +0.30%
==========================================
Files 17 17
Lines 2091 2083 -8
Branches 469 467 -2
==========================================
+ Hits 858 861 +3
+ Misses 1156 1145 -11
Partials 77 77
Flags with carried forward coverage won't be shown. Click here to find out more.
|
Why are these changes needed?
This PR extends the num_tokens_from_text function to support a wider range of language models beyond the "gpt-x" series. It enhances code flexibility and welcomes community contributions for various models, improving project versatility.
Related issue number
Closes #63
Checks